library(dplyr)
library(ggplot2)
library(broom)
library(janitor)
library(renv)
library(purrr)
library(tm)
library(SnowballC)
library(RColorBrewer)
library(ggplot2)
library(wordcloud)
library(biclust)
library(cluster)
library(igraph)
library(fpc)
library(magrittr)
library(rmarkdown)
library(textreuse)
library(slam)
library(plotly)
library(htmltools)
library(klaR)
library(tidyr)
library(stringr)One of the most important tasks of Natural Language Processing is text similarity. Text Similarity is the process of comparing a piece of text with another and finding the similarity between them. It’s basically about determining the degree of closeness of the text.
For this purpose, we chose a dataset of speeches from american presidents. Using Natural Language processing tools we firstly converted the data to DataFrame. Then we started to preprocess the data, firstly uniforming the documents using by removing punctuation, numbers and transforming it to lowercase. Then we continue with basic NLP text modifications like removal of stop words, tokenizing, lemmatization.
On the next step we used “tm” library tools in order to find the term similarity and later on we separated each speech into separate documents and used “textresuse” library to measure the document similarity and additionally, we visualized the results each time.
The dataset used for this project is president speeches obtained from this link.
Using the following script in Python, we first created a dataframe of the website’s speeches:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Scrapes transcripts for inaugural addresses
def get_urls(url):
'''Returns list of transcript urls'''
page = requests.get(url).text
soup=BeautifulSoup(page, 'lxml')
url_table = soup.find("table", class_='table').find_all("a")
return [u["href"] for u in url_table]
urls = get_urls("https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses")
transcripts = pd.DataFrame()
def get_transcripts(urls, transcripts):
for u in urls:
page = requests.get(u).text
soup = BeautifulSoup(page, 'lxml')
t_president = soup.find("h3", class_="diet-title").text
t_year = soup.find("span", class_="date-display-single").text.split(',')[1].strip()
t_content = soup.find("div", class_="field-docs-content").text
record = {
'president' : t_president,
'year' : t_year,
'content' : t_content
}
transcripts = transcripts.append(record, ignore_index=True)
return transcripts
data = get_transcripts(urls,transcripts)
data.to_csv("us_presidents_transcripts.csv", sep="|")
In what follows, we load the dataframe:
df <- read.csv("https://raw.githubusercontent.com/berserkhmdvhb/MADS-NLP/main/data/presidents-speech.csv")df |> dplyr::glimpse()## Rows: 59
## Columns: 4
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ president <chr> "George Washington", "George Washington", "John Adams", "Tho…
## $ year <int> 1789, 1793, 1797, 1801, 1805, 1809, 1813, 1817, 1821, 1825, …
## $ content <chr> "\nFellow-Citizens of the Senate and of the House of Represe…
In our datframe, we have 4 columns, X which is the index, president which displays the name of the presidents. Year which shows the year in which the president gave the speech and last one is the content. In the content field we have the content for each speech.
Below we check some details about the dataframe we created. It has 59 records. The earliest speech it has dates in 1789 and the latest in 2021.
df |> summary()## X president year content
## Min. : 0.0 Length:59 Min. :1789 Length:59
## 1st Qu.:14.5 Class :character 1st Qu.:1847 Class :character
## Median :29.0 Mode :character Median :1905 Mode :character
## Mean :29.0 Mean :1905
## 3rd Qu.:43.5 3rd Qu.:1963
## Max. :58.0 Max. :2021
In what follows, text files are generated from each row of dataframe and are stored in “texts” folder:
#presidents <- df[["president"]]|> unique() |>as.list()
for(i in 1:nrow(df)) { # for-loop over rows
df_i <- df[i, ]
name <- df_i$president
year <- df_i$year
text <- df_i$content |> stringr::str_trim()
file_name <- paste(as.character(year),
as.character(name),
sep="-")
file_name <- paste(file_name, ".txt",
sep="")
loc <- paste("./data/texts/", file_name, sep="")
writeLines(text, loc)
} loc <- "./data/texts/"
docs <- tm::VCorpus(DirSource(loc))
summary(docs) ## Length Class Mode
## 1789-George Washington.txt 2 PlainTextDocument list
## 1793-George Washington.txt 2 PlainTextDocument list
## 1797-John Adams.txt 2 PlainTextDocument list
## 1801-Thomas Jefferson.txt 2 PlainTextDocument list
## 1805-Thomas Jefferson.txt 2 PlainTextDocument list
## 1809-James Madison.txt 2 PlainTextDocument list
## 1813-James Madison.txt 2 PlainTextDocument list
## 1817-James Monroe.txt 2 PlainTextDocument list
## 1821-James Monroe.txt 2 PlainTextDocument list
## 1825-John Quincy Adams.txt 2 PlainTextDocument list
## 1829-Andrew Jackson.txt 2 PlainTextDocument list
## 1833-Andrew Jackson.txt 2 PlainTextDocument list
## 1837-Martin van Buren.txt 2 PlainTextDocument list
## 1841-William Henry Harrison.txt 2 PlainTextDocument list
## 1845-James K. Polk.txt 2 PlainTextDocument list
## 1849-Zachary Taylor.txt 2 PlainTextDocument list
## 1853-Franklin Pierce.txt 2 PlainTextDocument list
## 1857-James Buchanan.txt 2 PlainTextDocument list
## 1861-Abraham Lincoln.txt 2 PlainTextDocument list
## 1865-Abraham Lincoln.txt 2 PlainTextDocument list
## 1869-Ulysses S. Grant.txt 2 PlainTextDocument list
## 1873-Ulysses S. Grant.txt 2 PlainTextDocument list
## 1877-Rutherford B. Hayes.txt 2 PlainTextDocument list
## 1881-James A. Garfield.txt 2 PlainTextDocument list
## 1885-Grover Cleveland.txt 2 PlainTextDocument list
## 1889-Benjamin Harrison.txt 2 PlainTextDocument list
## 1893-Grover Cleveland.txt 2 PlainTextDocument list
## 1897-William McKinley.txt 2 PlainTextDocument list
## 1901-William McKinley.txt 2 PlainTextDocument list
## 1905-Theodore Roosevelt.txt 2 PlainTextDocument list
## 1909-William Howard Taft.txt 2 PlainTextDocument list
## 1913-Woodrow Wilson.txt 2 PlainTextDocument list
## 1917-Woodrow Wilson.txt 2 PlainTextDocument list
## 1921-Warren G. Harding.txt 2 PlainTextDocument list
## 1925-Calvin Coolidge.txt 2 PlainTextDocument list
## 1929-Herbert Hoover.txt 2 PlainTextDocument list
## 1933-Franklin D. Roosevelt.txt 2 PlainTextDocument list
## 1937-Franklin D. Roosevelt.txt 2 PlainTextDocument list
## 1941-Franklin D. Roosevelt.txt 2 PlainTextDocument list
## 1945-Franklin D. Roosevelt.txt 2 PlainTextDocument list
## 1949-Harry S. Truman.txt 2 PlainTextDocument list
## 1953-Dwight D. Eisenhower.txt 2 PlainTextDocument list
## 1957-Dwight D. Eisenhower.txt 2 PlainTextDocument list
## 1961-John F. Kennedy.txt 2 PlainTextDocument list
## 1965-Lyndon B. Johnson.txt 2 PlainTextDocument list
## 1969-Richard Nixon.txt 2 PlainTextDocument list
## 1973-Richard Nixon.txt 2 PlainTextDocument list
## 1977-Jimmy Carter.txt 2 PlainTextDocument list
## 1981-Ronald Reagan.txt 2 PlainTextDocument list
## 1985-Ronald Reagan.txt 2 PlainTextDocument list
## 1989-George Bush.txt 2 PlainTextDocument list
## 1993-William J. Clinton.txt 2 PlainTextDocument list
## 1997-William J. Clinton.txt 2 PlainTextDocument list
## 2001-George W. Bush.txt 2 PlainTextDocument list
## 2005-George W. Bush.txt 2 PlainTextDocument list
## 2009-Barack Obama.txt 2 PlainTextDocument list
## 2013-Barack Obama.txt 2 PlainTextDocument list
## 2017-Donald J. Trump.txt 2 PlainTextDocument list
## 2021-Joseph R. Biden.txt 2 PlainTextDocument list
inspect(docs[1])## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 8617
Here we check the content of document one which should refer to the speech by Goerge Washington in 1789. We will use the content from this document as a demonstation for the preprocessing part.
writeLines(as.character(docs[1]))## list(list(content = c("Fellow-Citizens of the Senate and of the House of Representatives:", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead [see APP note] me, and its consequences be judged by my country with some share of the partiality in which they originated.",
## "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow-citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence.",
## "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people.",
## "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted.",
## "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require.",
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend."
## ), meta = list(author = character(0), datetimestamp = list(sec = 41.346049785614, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = "1789-George Washington.txt", language = "en", origin = character(0))))
## list()
## list()
This project is dedicated to investigating text similarity between speeches from different presidents of US during various years, starting from 1789 and ending with 2021.
In Preprocessing section, numerous text mining tasks are implemented on all the documents.
In Term Similarity section, frequency of different terms in documents are analyzed and visualized.
In Doc Similarity, similarity between documents is measured, analyzed, and visualized.
In Conclusion, main findings are summarized.
The github repository for this package can be found in this link
The tm is a framework for text mining applications within R. Most functions used henceforth originate from this package.
The punctuation removal process will help to treat each text equally. For example, the word data and data! are treated equally after the process of removal of punctuations. After the removal we print the content of the first document one more time and check the results. The sentences are divided by “,” and are within quotes, but inside the quotes the punctuation is removed.
docs <- tm::tm_map(docs,removePunctuation)
writeLines(as.character(docs[1])) ## list(list(content = c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the 14th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated",
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence",
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people",
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted",
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require",
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 41.346049785614, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = "1789-George Washington.txt", language = "en", origin = character(0))))
## list()
## list()
Secondly, we remove all special characters. For this purpose we use gsub which replaces the special characters dictated by us with space. We check the document one more time.
for (j in seq(docs)) {
docs[[j]] <- gsub("/", " ", docs[[j]])
docs[[j]] <- gsub("@", " ", docs[[j]])
docs[[j]] <- gsub("\\|", " ", docs[[j]])
docs[[j]] <- gsub("\u2028", " ", docs[[j]]) # This is an ascii character that did not translate, so it had to be removed.
}
writeLines(as.character(docs[1]))## list(c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the 14th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated",
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence",
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people",
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted",
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require",
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ))
## list()
## list()
In this step, in order to make the text more uniform, we remove all the numerical forms. For doing so, there exists a function from the tm library, called removeNumbers that will do this.
docs <- tm::tm_map(docs, removeNumbers)
writeLines(as.character(docs[1])) ## list(c("FellowCitizens of the Senate and of the House of Representatives", "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the th day of the present month On the one hand I was summoned by my country whose voice I can never hear but with veneration and love from a retreat which I had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time On the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected All I dare hope is that if in executing this task I have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see APP note me and its consequences be judged by my country with some share of the partiality in which they originated",
## "Such being the impressions under which I have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge In tendering this homage to the Great Author of every public and private good I assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage These reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed You will join with me I trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence",
## "By the article establishing the executive department it is made the duty of the President to recommend to your consideration such measures as he shall judge necessary and expedient The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given It will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world I dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the American people",
## "Besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them Instead of undertaking particular recommendations on this subject in which I could be guided by no lights derived from official opportunities I shall again give way to my entire confidence in your discernment and pursuit of the public good for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted",
## "To the foregoing observations I have one to add which will be most properly addressed to the House of Representatives It concerns myself and will therefore be as brief as possible When I was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which I contemplated my duty required that I should renounce every pecuniary compensation From this resolution I have in no instance departed and being still under the impressions which produced it I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require",
## "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together I shall take my present leave but not without resorting once more to the benign Parent of the Human Race in humble supplication that since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so His divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this Government must depend"
## ))
## list()
## list()
Again, serving the uniformity purposed, we transform all the uppercase to lowercase. Words like “Book” and “book’ mean the same but when not converted to the lower case, those two are represented as two different words in the vector space model (resulting in more dimensions).
Checking the first document below, we see that now the first word of the speech, respectively “Felowcitizens” starts with a lowercase.
docs <- tm::tm_map(docs, tolower)
docs <- tm::tm_map(docs, PlainTextDocument)
DocsCopy <- docs
writeLines(as.character(docs[1])) ## list(list(content = c("fellowcitizens of the senate and of the house of representatives", "among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order and received on the th day of the present month on the one hand i was summoned by my country whose voice i can never hear but with veneration and love from a retreat which i had chosen with the fondest predilection and in my flattering hopes with an immutable decision as the asylum of my declining years—a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination and of frequent interruptions in my health to the gradual waste committed on it by time on the other hand the magnitude and difficulty of the trust to which the voice of my country called me being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications could not but overwhelm with despondence one who inheriting inferior endowments from nature and unpracticed in the duties of civil administration ought to be peculiarly conscious of his own deficiencies in this conflict of emotions all i dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected all i dare hope is that if in executing this task i have been too much swayed by a grateful remembrance of former instances or by an affectionate sensibility to this transcendent proof of the confidence of my fellowcitizens and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me my error will be palliated by the motives which mislead see app note me and its consequences be judged by my country with some share of the partiality in which they originated",
## "such being the impressions under which i have in obedience to the public summons repaired to the present station it would be peculiarly improper to omit in this first official act my fervent supplications to that almighty being who rules over the universe who presides in the councils of nations and whose providential aids can supply every human defect that his benediction may consecrate to the liberties and happiness of the people of the united states a government instituted by themselves for these essential purposes and may enable every instrument employed in its administration to execute with success the functions allotted to his charge in tendering this homage to the great author of every public and private good i assure myself that it expresses your sentiments not less than my own nor those of my fellowcitizens at large less than either no people can be bound to acknowledge and adore the invisible hand which conducts the affairs of men more than those of the united states every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude along with an humble anticipation of the future blessings which the past seem to presage these reflections arising out of the present crisis have forced themselves too strongly on my mind to be suppressed you will join with me i trust in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence",
## "by the article establishing the executive department it is made the duty of the president to recommend to your consideration such measures as he shall judge necessary and expedient the circumstances under which i now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled and which in defining your powers designates the objects to which your attention is to be given it will be more consistent with those circumstances and far more congenial with the feelings which actuate me to substitute in place of a recommendation of particular measures the tribute that is due to the talents the rectitude and the patriotism which adorn the characters selected to devise and adopt them in these honorable qualifications i behold the surest pledges that as on one side no local prejudices or attachments no separate views nor party animosities will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests so on another that the foundation of our national policy will be laid in the pure and immutable principles of private morality and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world i dwell on this prospect with every satisfaction which an ardent love for my country can inspire since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness between duty and advantage between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity since we ought to be no less persuaded that the propitious smiles of heaven can never be expected on a nation that disregards the eternal rules of order and right which heaven itself has ordained and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered perhaps as deeply as finally staked on the experiment entrusted to the hands of the american people",
## "besides the ordinary objects submitted to your care it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system or by the degree of inquietude which has given birth to them instead of undertaking particular recommendations on this subject in which i could be guided by no lights derived from official opportunities i shall again give way to my entire confidence in your discernment and pursuit of the public good for i assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government or which ought to await the future lessons of experience a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted",
## "to the foregoing observations i have one to add which will be most properly addressed to the house of representatives it concerns myself and will therefore be as brief as possible when i was first honored with a call into the service of my country then on the eve of an arduous struggle for its liberties the light in which i contemplated my duty required that i should renounce every pecuniary compensation from this resolution i have in no instance departed and being still under the impressions which produced it i must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department and must accordingly pray that the pecuniary estimates for the station in which i am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require",
## "having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together i shall take my present leave but not without resorting once more to the benign parent of the human race in humble supplication that since he has been pleased to favor the american people with opportunities for deliberating in perfect tranquillity and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness so his divine blessing may be equally conspicuous in the enlarged views the temperate consultations and the wise measures on which the success of this government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 41.4730560779572, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.
# For a list of the stopwords, see:
length(stopwords("english")) ## [1] 174
docs <- tm::tm_map(docs, removeWords, stopwords("english"))
docs <- tm::tm_map(docs, PlainTextDocument)
writeLines(as.character(docs[1]))## list(list(content = c("fellowcitizens senate house representatives", "among vicissitudes incident life event filled greater anxieties notification transmitted order received th day present month one hand summoned country whose voice can never hear veneration love retreat chosen fondest predilection flattering hopes immutable decision asylum declining years— retreat rendered every day necessary well dear addition habit inclination frequent interruptions health gradual waste committed time hand magnitude difficulty trust voice country called sufficient awaken wisest experienced citizens distrustful scrutiny qualifications overwhelm despondence one inheriting inferior endowments nature unpracticed duties civil administration peculiarly conscious deficiencies conflict emotions dare aver faithful study collect duty just appreciation every circumstance might affected dare hope executing task much swayed grateful remembrance former instances affectionate sensibility transcendent proof confidence fellowcitizens thence little consulted incapacity well disinclination weighty untried cares error will palliated motives mislead see app note consequences judged country share partiality originated",
## " impressions obedience public summons repaired present station peculiarly improper omit first official act fervent supplications almighty rules universe presides councils nations whose providential aids can supply every human defect benediction may consecrate liberties happiness people united states government instituted essential purposes may enable every instrument employed administration execute success functions allotted charge tendering homage great author every public private good assure expresses sentiments less fellowcitizens large less either people can bound acknowledge adore invisible hand conducts affairs men united states every step advanced character independent nation seems distinguished token providential agency important revolution just accomplished system united government tranquil deliberations voluntary consent many distinct communities event resulted can compared means governments established without return pious gratitude along humble anticipation future blessings past seem presage reflections arising present crisis forced strongly mind suppressed will join trust thinking none influence proceedings new free government can auspiciously commence",
## " article establishing executive department made duty president recommend consideration measures shall judge necessary expedient circumstances now meet will acquit entering subject refer great constitutional charter assembled defining powers designates objects attention given will consistent circumstances far congenial feelings actuate substitute place recommendation particular measures tribute due talents rectitude patriotism adorn characters selected devise adopt honorable qualifications behold surest pledges one side local prejudices attachments separate views party animosities will misdirect comprehensive equal eye watch great assemblage communities interests another foundation national policy will laid pure immutable principles private morality preeminence free government exemplified attributes can win affections citizens command respect world dwell prospect every satisfaction ardent love country can inspire since truth thoroughly established exists economy course nature indissoluble union virtue happiness duty advantage genuine maxims honest magnanimous policy solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expected nation disregards eternal rules order right heaven ordained since preservation sacred fire liberty destiny republican model government justly considered perhaps deeply finally staked experiment entrusted hands american people",
## "besides ordinary objects submitted care will remain judgment decide far exercise occasional power delegated fifth article constitution rendered expedient present juncture nature objections urged system degree inquietude given birth instead undertaking particular recommendations subject guided lights derived official opportunities shall give way entire confidence discernment pursuit public good assure whilst carefully avoid every alteration might endanger benefits united effective government await future lessons experience reverence characteristic rights freemen regard public harmony will sufficiently influence deliberations question far former can impregnably fortified latter safely advantageously promoted",
## " foregoing observations one add will properly addressed house representatives concerns will therefore brief possible first honored call service country eve arduous struggle liberties light contemplated duty required renounce every pecuniary compensation resolution instance departed still impressions produced must decline inapplicable share personal emoluments may indispensably included permanent provision executive department must accordingly pray pecuniary estimates station placed may continuance limited actual expenditures public good may thought require",
## " thus imparted sentiments awakened occasion brings us together shall take present leave without resorting benign parent human race humble supplication since pleased favor american people opportunities deliberating perfect tranquillity dispositions deciding unparalleled unanimity form government security union advancement happiness divine blessing may equally conspicuous enlarged views temperate consultations wise measures success government must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 41.6906197071075, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
#docs <- tm::tm_map(docs, removeWords, c("syllogism", "tautology"))
# Just remove the words "syllogism" and "tautology".
# These words don't actually exist in these texts. But this is how you would remove them if they had.If you wish to preserve a concept which is only apparent as a collection of two or more words, then you can combine them or reduce them to a meaningful acronym before you begin the analysis. Here, we are using examples that are particular to qualitative data analysis.
for (j in seq(docs))
{
docs[[j]] <- gsub("fake news", "fake_news", docs[[j]])
docs[[j]] <- gsub("inner city", "inner-city", docs[[j]])
docs[[j]] <- gsub("politically correct", "politically_correct", docs[[j]])
}
docs <- tm_map(docs, PlainTextDocument)docs <- tm_map(docs, stripWhitespace)
writeLines(as.character(docs[1]))## list(list(content = c("fellowcitizens senate house representatives", "among vicissitudes incident life event filled greater anxieties notification transmitted order received th day present month one hand summoned country whose voice can never hear veneration love retreat chosen fondest predilection flattering hopes immutable decision asylum declining years— retreat rendered every day necessary well dear addition habit inclination frequent interruptions health gradual waste committed time hand magnitude difficulty trust voice country called sufficient awaken wisest experienced citizens distrustful scrutiny qualifications overwhelm despondence one inheriting inferior endowments nature unpracticed duties civil administration peculiarly conscious deficiencies conflict emotions dare aver faithful study collect duty just appreciation every circumstance might affected dare hope executing task much swayed grateful remembrance former instances affectionate sensibility transcendent proof confidence fellowcitizens thence little consulted incapacity well disinclination weighty untried cares error will palliated motives mislead see app note consequences judged country share partiality originated",
## " impressions obedience public summons repaired present station peculiarly improper omit first official act fervent supplications almighty rules universe presides councils nations whose providential aids can supply every human defect benediction may consecrate liberties happiness people united states government instituted essential purposes may enable every instrument employed administration execute success functions allotted charge tendering homage great author every public private good assure expresses sentiments less fellowcitizens large less either people can bound acknowledge adore invisible hand conducts affairs men united states every step advanced character independent nation seems distinguished token providential agency important revolution just accomplished system united government tranquil deliberations voluntary consent many distinct communities event resulted can compared means governments established without return pious gratitude along humble anticipation future blessings past seem presage reflections arising present crisis forced strongly mind suppressed will join trust thinking none influence proceedings new free government can auspiciously commence",
## " article establishing executive department made duty president recommend consideration measures shall judge necessary expedient circumstances now meet will acquit entering subject refer great constitutional charter assembled defining powers designates objects attention given will consistent circumstances far congenial feelings actuate substitute place recommendation particular measures tribute due talents rectitude patriotism adorn characters selected devise adopt honorable qualifications behold surest pledges one side local prejudices attachments separate views party animosities will misdirect comprehensive equal eye watch great assemblage communities interests another foundation national policy will laid pure immutable principles private morality preeminence free government exemplified attributes can win affections citizens command respect world dwell prospect every satisfaction ardent love country can inspire since truth thoroughly established exists economy course nature indissoluble union virtue happiness duty advantage genuine maxims honest magnanimous policy solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expected nation disregards eternal rules order right heaven ordained since preservation sacred fire liberty destiny republican model government justly considered perhaps deeply finally staked experiment entrusted hands american people",
## "besides ordinary objects submitted care will remain judgment decide far exercise occasional power delegated fifth article constitution rendered expedient present juncture nature objections urged system degree inquietude given birth instead undertaking particular recommendations subject guided lights derived official opportunities shall give way entire confidence discernment pursuit public good assure whilst carefully avoid every alteration might endanger benefits united effective government await future lessons experience reverence characteristic rights freemen regard public harmony will sufficiently influence deliberations question far former can impregnably fortified latter safely advantageously promoted",
## " foregoing observations one add will properly addressed house representatives concerns will therefore brief possible first honored call service country eve arduous struggle liberties light contemplated duty required renounce every pecuniary compensation resolution instance departed still impressions produced must decline inapplicable share personal emoluments may indispensably included permanent provision executive department must accordingly pray pecuniary estimates station placed may continuance limited actual expenditures public good may thought require",
## " thus imparted sentiments awakened occasion brings us together shall take present leave without resorting benign parent human race humble supplication since pleased favor american people opportunities deliberating perfect tranquillity dispositions deciding unparalleled unanimity form government security union advancement happiness divine blessing may equally conspicuous enlarged views temperate consultations wise measures success government must depend"), meta = list(author = character(0), datetimestamp = list(
## sec = 41.7397825717926, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
docs <- tm_map(docs, PlainTextDocument)The stemDocument from the tm package performs stemming on the documents. However, after the stemming is done, we have some words that are not complete words after being stemmed. For instance, all words that are constructed from “age”, e.g., “aging”, “ages”, etc. will be transformed to “ag” after stemming. As another instance, “people” will transform to “peopl”. In following, the reulst manifest the mentioned issue:
dictCorpus <- docs
docs <- tm_map(docs, stemDocument)
writeLines(as.character(docs[1]))## list(list(content = c("fellowcitizen senat hous repres", "among vicissitud incid life event fill greater anxieti notif transmit order receiv th day present month one hand summon countri whose voic can never hear vener love retreat chosen fondest predilect flatter hope immut decis asylum declin years— retreat render everi day necessari well dear addit habit inclin frequent interrupt health gradual wast commit time hand magnitud difficulti trust voic countri call suffici awaken wisest experienc citizen distrust scrutini qualif overwhelm despond one inherit inferior endow natur unpract duti civil administr peculiar conscious defici conflict emot dare aver faith studi collect duti just appreci everi circumst might affect dare hope execut task much sway grate remembr former instanc affection sensibl transcend proof confid fellowcitizen thenc littl consult incapac well disinclin weighti untri care error will palliat motiv mislead see app note consequ judg countri share partial origin",
## "impress obedi public summon repair present station peculiar improp omit first offici act fervent supplic almighti rule univers presid council nation whose providenti aid can suppli everi human defect benedict may consecr liberti happi peopl unit state govern institut essenti purpos may enabl everi instrument employ administr execut success function allot charg tender homag great author everi public privat good assur express sentiment less fellowcitizen larg less either peopl can bound acknowledg ador invis hand conduct affair men unit state everi step advanc charact independ nation seem distinguish token providenti agenc import revolut just accomplish system unit govern tranquil deliber voluntari consent mani distinct communiti event result can compar mean govern establish without return pious gratitud along humbl anticip futur bless past seem presag reflect aris present crisi forc strong mind suppress will join trust think none influenc proceed new free govern can auspici commenc",
## "articl establish execut depart made duti presid recommend consider measur shall judg necessari expedi circumst now meet will acquit enter subject refer great constitut charter assembl defin power design object attent given will consist circumst far congeni feel actuat substitut place recommend particular measur tribut due talent rectitud patriot adorn charact select devis adopt honor qualif behold surest pledg one side local prejudic attach separ view parti animos will misdirect comprehens equal eye watch great assemblag communiti interest anoth foundat nation polici will laid pure immut principl privat moral preemin free govern exemplifi attribut can win affect citizen command respect world dwell prospect everi satisfact ardent love countri can inspir sinc truth thorough establish exist economi cours natur indissolubl union virtu happi duti advantag genuin maxim honest magnanim polici solid reward public prosper felic sinc less persuad propiti smile heaven can never expect nation disregard etern rule order right heaven ordain sinc preserv sacr fire liberti destini republican model govern just consid perhap deepli final stake experi entrust hand american peopl",
## "besid ordinari object submit care will remain judgment decid far exercis occasion power deleg fifth articl constitut render expedi present junctur natur object urg system degre inquietud given birth instead undertak particular recommend subject guid light deriv offici opportun shall give way entir confid discern pursuit public good assur whilst care avoid everi alter might endang benefit unit effect govern await futur lesson experi rever characterist right freemen regard public harmoni will suffici influenc deliber question far former can impregn fortifi latter safe advantag promot",
## "forego observ one add will proper address hous repres concern will therefor brief possibl first honor call servic countri eve arduous struggl liberti light contempl duti requir renounc everi pecuniari compens resolut instanc depart still impress produc must declin inapplic share person emolu may indispens includ perman provis execut depart must accord pray pecuniari estim station place may continu limit actual expenditur public good may thought requir", "thus impart sentiment awaken occas bring us togeth shall take present leav without resort benign parent human race humbl supplic sinc pleas favor american peopl opportun deliber perfect tranquil disposit decid unparallel unanim form govern secur union advanc happi divin bless may equal conspicu enlarg view temper consult wise measur success govern must depend"
## ), meta = list(author = character(0), datetimestamp = list(sec = 41.7660303115845, min = 16, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
As these words do not have meaning, they would lose our touch with reality during later analyzing steps (e.g., when analyzing word frequency), as the results would be unknown words. To resolve this, the stemCompletion can be used from the tm package. While stemDocument is designed to be mapped on the whole corpus, and hence all its containing documents, the stemCompletion is designed to be only implemented on a given word. As library doesn’t provide a version of stemCompletion applicable to the whole corpus, we have manually implemented it on all the documents inside the corpus using the sapply function. Note that the stemCompletion performs the completion of words by referencing to a source corpus. To provide this, the corpus we had is stored before stemming, the copied corpus (also called dictionary) is stored in dictCorpus, and then it is fed to the stemCompletion later. As the stemCompletion has another drawback, which is replacing empty strings with unsolicited never existed words, we avoided this by defining a modified version of stemCompletion that avoids this behavior.
stemCompletion_mod <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}stemCompletion_mod(docs[[1]], dictCorpus) |> as.character() |> writeLines()## fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend
docs <- lapply(docs, stemCompletion_mod, dictionary=dictCorpus)
docs <- as.VCorpus(docs)
#docs <- tm_map(docs, PlainTextDocument)
writeLines(as.character(docs[1]))## list(`character(0)` = list(content = "fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend",
## meta = list(author = character(0), datetimestamp = list(sec = 57.3874199390411, min = 17, hour = 14, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
Evidenced by the stemmed and then stem-completed documents, we are provided with a document that has been stemmed, yet contains meaningful and compelete words, as expected.
summary(docs)## Length Class Mode
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
## character(0) 2 PlainTextDocument list
Be sure to use the following script once you have completed preprocessing. This tells R to treat the preprocessed documents as text documents.
docs <- tm::tm_map(docs, PlainTextDocument)
writeLines(as.character(docs[1]))## list(`character(0)` = list(content = "fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend",
## meta = list(author = character(0), datetimestamp = list(sec = 49.9672451019287, min = 37, hour = 15, mday = 11, mon = 0, year = 123, wday = 3, yday = 10, isdst = 0), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0))))
## list()
## list()
nrow(df)## [1] 59
In the below piece of code, we save the preprocessed documents into another folder because later on we need to reuse the results to measure the document similarity using another library which is textreuse.
#Elnaz
for(i in 1:nrow(df)) { # for-loop over rows
df_i <- df[i, ]
name <- df_i$president
year <- df_i$year
text <- df_i$content
file_name <- paste(as.character(year),
as.character(name),
sep="-")
file_name <- paste(file_name, ".txt",
sep="")
loc <- paste("./data/pre_processed/", file_name, sep="")
writeLines(as.character(docs[[i]]), loc)
}A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
dtm <- tm::DocumentTermMatrix(docs)
dtm ## <<DocumentTermMatrix (documents: 59, terms: 5172)>>
## Non-/sparse entries: 33346/271802
## Sparsity : 89%
## Maximal term length: 23
## Weighting : term frequency (tf)
Storing transpose of matrix
tdm <- tm::TermDocumentMatrix(docs)
tdm ## <<TermDocumentMatrix (terms: 5172, documents: 59)>>
## Non-/sparse entries: 33346/271802
## Sparsity : 89%
## Maximal term length: 23
## Weighting : term frequency (tf)
freq <- colSums(as.matrix(dtm))
length(freq) ## [1] 5172
ord <- order(freq)
m <- as.matrix(dtm)
dim(m) ## [1] 59 5172
Store the matrix to memory
#write.csv(m, file="DocumentTermMatrix.csv") We remove sparse words putting a 20% sparsity thresshold, and when we check our results, the sparsity for our matrix is 12%.
# Start by removing sparse terms:
dtms <- removeSparseTerms(dtm, 0.2) # This makes a matrix that is 20% empty space, maximum.
dtms## <<DocumentTermMatrix (documents: 59, terms: 39)>>
## Non-/sparse entries: 2023/278
## Sparsity : 12%
## Maximal term length: 14
## Weighting : term frequency (tf)
We save the matrix as the frequency of the terms.
freq <- colSums(as.matrix(dtm))Least frequent
We print the head of the frequency table. Our table is increasing. So the ones appearing at the head have 1 frequency therefore the smallest possible number and it increases until at the tail we have the most frequent words.
head(table(freq), 20) ## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1932 660 366 259 204 145 130 91 86 73 83 62 69 56 51 45
## 17 18 19 20
## 38 43 31 36
The top number is the frequency with which words appear and the bottom number reflects how many words appear that frequently.
Most frequent:
tail(table(freq), 40) ## freq
## 161 164 165 174 176 177 178 185 194 195 201 202 211 229 231 232 238 250 253 269
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1
## 272 279 280 285 289 299 304 314 341 346 355 373 374 380 461 488 626 689 724 963
## 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
Below we show a table of the terms we selected when we removed sparse terms in subsection Remove sparse words We print the 20 first most frequent terms.
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
freq |> head(20)## will government nation people can
## 963 724 689 626 488
## states great power upon must
## 461 380 374 374 373
## countries world may shall everincreasing
## 355 346 341 314 304
## constitution justice peace one rights
## 299 289 285 280 279
Below we identify all terms that appear frequently.
findFreqTerms(dtm, lowfreq=50) |> head(20)## [1] "action" "administration" "advance" "aid"
## [5] "also" "always" "america" "american"
## [9] "among" "another" "arms" "ask"
## [13] "authority" "become" "believe" "best"
## [17] "better" "beyond" "blessings" "bring"
Another approach to perform the same task:
wf <- data.frame(word=names(freq), freq=freq)
head(wf) ## word freq
## will will 963
## government government 724
## nation nation 689
## people people 626
## can can 488
## states states 461
Now it is time to visualize our results to better understand and perceive them. Using ggplot we show a bar plot with words that appear more than 200 times. In the x-axis we can see clearly which are these words. They are presented in the root form since we applied stemming.
p <- ggplot(subset(wf, freq>200), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
p Here we find the correlations between the terms as if 2 words are always appeared together in a text then the correlation between them would be 1. The correlation limit is considered as 0.75:
tm::findAssocs(dtm, c("government" , "states"), corlimit=0.75)## $government
## system
## 0.78
##
## $states
## portion constitution duties object existence ruin
## 0.83 0.80 0.80 0.80 0.79 0.79
## may
## 0.78
findAssocs(dtms, "government", corlimit=0.70) # specifying a correlation limit of 0.95 ## $government
## states
## 0.75
Plot words that occur at least 25 times.
Colorized version:
In this part the word clouds are visualized. The bigger the size of the word in the word cloud, the more frequent it is. Also words are clustered based one frequency with different colors.
set.seed(142)
wordcloud::wordcloud(names(freq), freq, min.freq=20, scale=c(5, .1), colors=brewer.pal(6, "Dark2")) ## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## government could not be fit on page. It will not be plotted.
## Warning in wordcloud::wordcloud(names(freq), freq, min.freq = 20, scale = c(5, :
## people could not be fit on page. It will not be plotted.
Plot words that occur at least 100 times.
We use the same way of plotting, therefore the size and color stand for the same reasons.
set.seed(142)
dark2 <- brewer.pal(6, "Dark2")
wordcloud::wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2) To do the Hierarchical clustering, first we should find the distance between words and for this purpose we used Euclidean norm and then clustered based on those distances.
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="complete") # for a different look try substituting: method="ward.D"
fit ##
## Call:
## hclust(d = d, method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 39
Dendrograms are the plots used to visualize the hierarchal clustering. If the height of the line joining 2 terms is smaller, it shows that they are more similar, whereas Higher lines in dendrograms indicate larger distance between the clusters.
plot(fit, hang=-1) And here the red boxes show the 6 clusters:
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=6, border="red") # draw dendogram with red borders around the 6 clustersTo do the k-means clustering, first we should find the distance between words and for this purpose we used 3 different norms(“Euclidian”, “Manhattan”,“Maximum”) and the clustered based on them.
In what follows, there are clusplots for K-Means clustering with different norms. In clusplots, each ellipse indicate a Principal Component. At the bottom of each plot we can see the percentage of the point variability explained by these components.
Norm: Euclidean
d <- dist(t(dtms), method="euclidean")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)Norm: Manhattan
d <- dist(t(dtms), method="manhattan")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)d <- dist(t(dtms), method="manhattan")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)Norm: Maximum
d <- dist(t(dtms), method="maximum")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)d <- dist(t(dtms), method="maximum")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)Evidenced by the results from k-means, in most of the norms, the component having the most effectiveness in explaining the document term matrix, contain the following terms: {will, power, government, states, nation, people}
For this section, textreuse library, is used.
For finding the similarity score, we are using textreuse library. There are a set of functions in this library which take two sets or bag of words and measure their similarity or dissimilarity. They are:
For this project we are using Jaccard Similarity and Ratio of matches. Let’s describe them and look at the results more closely.
The Jaccard measurements of similarity for two sets is provided by the function jaccard similarity. The coefficients will range from 0 to 1. The greater the number for the similarity coefficient, the more similar the two sets are to one another.
\[J(A, B) = | A \cap B | / |A \cup B|\]
The ratio between the number of items in b that are also in a is determined by the function ratio of matches. The directionality of this similarity metric should be noted; it quantifies how much b borrows from a while omitting any information regarding how much a borrows from b.
We compare documents in a pairwise manner and use jaccard similarity to measure the similarity between them. After doing so, a score is calculated for each pair as shown in the results:
#loc <- "./data/texts"
#docs <- tm::VCorpus(DirSource(loc))
loc <- "./data/pre_processed/"
corpus <- TextReuseCorpus(dir=loc)
comparisons <- pairwise_compare(corpus, jaccard_similarity)
compare_df <- pairwise_candidates(comparisons)
compare_df <- as.data.frame(compare_df,
col.names = names(compare_df))
#compare_df <- compare_df[order(compare_df$score,decreasing=TRUE)]
compare_df <- compare_df[order(compare_df$score,decreasing=TRUE),]
compare_df |> head(3)## a b score
## 38 1789-George Washington 1941-Franklin D. Roosevelt 0.007974482
## 182 1801-Thomas Jefferson 1845-James K. Polk 0.007952286
## 122 1797-John Adams 1825-John Quincy Adams 0.006680585
#Najada
corpus## TextReuseCorpus
## Number of documents: 59
## hash_func : hash_string
## tokenizer : tokenize_ngrams
writeLines(as.character(corpus[1]))## list(`1789-George Washington` = list(content = "fellowcitizens senate house representatives among vicissitudes incident life events fill greater anxieties notification transmit order receive things day present months one hand summoned countries whose voice can never heart veneration love retreat chosen fondest predilection flattered hope immutable decisions asylum decline years— retreat render everincreasing day necessarily well dear additional habits inclination frequent interrupted health gradually waste committed time hand magnitude difficulties trust voice countries called sufficient awakened wisest experience citizens distrust scrutinize qualifications overwhelming despondence one inheritance inferior endowed nature unpracticed duties civil administration peculiar consciousness deficit conflict emotions dare avert faith studied collected duties justice appreciation everincreasing circumstances might affecting dare hope executive task much swayed grateful remembrance former instance affection sensible transcendent proof confidence fellowcitizens thence little consultations incapacity well disinclination weightiest untried care error will palliated motives mislead see appear note consequences judgment countries share partial original impressed obedience public summoned repair present station peculiar improper omit first official action fervent supplications rule universal president councils nation whose providential aid can supplications everincreasing human defects benediction may consecrate liberties happiness people united states government institutions essential purpose may enable everincreasing instrument employed administration executive success functions allotted charged tender homage great authority everincreasing public private good assured expression sentiment less fellowcitizens large less either people can bound acknowledged adore invisible hand conduct affairs men united states everincreasing steps advance character independence nation seem distinguished token providential agencies important revolution justice accomplished system united government tranquillity deliberate voluntarily consent manifest distinction communities events result can comparative means government established without return pious gratitude along humble anticipated future blessings past seem presage reflect arise present crisis force strong mind suppression will join trust think none influence proceed new free government can auspicious commencement articles established executive departments made duties president recommend consideration measures shall judgment necessarily expedient circumstances now meet will acquit enterprise subject reference great constitution charter assembled define power designed object attention given will consistent circumstances far congenial feel actuated substitute place recommend particular measures tribute due talents rectitude patriotism adorn character selected devised adoption honor qualifications behold surest pledge one side local prejudice attachment separate view parties animosities will misdirect comprehensive equal eyes watching great assemblage communities interests another foundations nation policies will laid pure immutable principles private moral preeminent free government exemplified attributes can win affecting citizens command respect world dwell prospect everincreasing satisfaction ardent love countries can inspire since truth thorough established existence economic course nature indissoluble union virtue happiness duties advantage genuine maxim honest magnanimity policies solid rewards public prosperity felicity since less persuaded propitious smiles heaven can never expect nation disregard eternal rule order rights heaven ordained since preserve sacred fire liberties destinies republican model government justice consideration perhaps finally stake experience entrusted hand american people besides object submit care will remain judgment decide far exercise occasion power delegated fifth articles constitution render expedient present juncture nature object urge system degree inquietude given birth instead undertake particular recommend subject guidance light derived official opportunity shall give way entire confidence discern pursuit public good assured whilst care avoid everincreasing altered might endanger benefits united effect government await future lesson experience reverence characteristic rights freemen regard public harmonious will sufficient influence deliberate question far former can impregnable fortifications latter safety advantage promote forego observe one add will proper address house representatives concern will therefore brief possible first honor called service countries every arduous struggle liberties light contemplate duties require renounce everincreasing compensation resolution instance departments still impressed produce must decline inapplicable share personal emoluments may indispensable including permanent provision executive departments must according prayer estimate station place may continue limits actual expenditures public good may thought require thus impartial sentiment awakened occasion bring us together shall take present leave without resort benign parent human race humble supplications since pleasing favor american people opportunity deliberate perfect tranquillity disposition decide unparalleled unanimity form government secure union advance happiness divine blessings may equal conspicuous enlarged view temper consultations wise measures success government must depend",
## tokens = NULL, hashes = c(587687171, 1400508662, -1528904364, -1176008363, 2104028628, 242661590, 58376289, -1708133020, -781087718, -201365644, 1724523801, -378243016, 311718203, -2014625511, -625873884, -1545226081, -1475521618, -1523095350, -1015023003, -1005865629, 966616660, 1811332323, 1085454582, -1991377151, 1461987856, 2109740839, -1163584676, 373666486, 636751197, -950716712, -1700284989, 474818159, -2145645941, -579825031, 1426802282, -1465868929, 814227590, -194037280, -1695522327,
## -1013519161, 1820428539, 2084747588, -929758049, 1815976130, 1013951540, -978794643, -685841600, -1880644656, 1601250174, 437599726, 1047111351, -284541230, -1300956799, 429423337, 264040155, -1151923068, 1709616450, -1142690907, -87400543, 574683610, 720630299, 313783217, 1813713933, -233312219, 431165055, -2010755526, 768567022, 1673368949, 1282301332, -1937480857, -686653338, -279787596, 283377037, -1467375042, 1413012991, 1234290454, 2032835754, -252756277, -2058506222, -567440084, 586226657,
## -1699176042, 169265343, -1822480134, -1695678832, -279195864, 993750981, -768462026, -1403486852, 1421944655, -1354910634, 227991290, 695007735, -1910589095, -1491126517, 431863399, -38769206, -486552898, 1506239656, 1410998188, -734962862, -484263926, 71679551, 76514262, -926313942, 849915209, 993045926, -1941284721, 1545847062, -772821371, 1281525178, -1210520160, -1762627151, -1067577156, -1755919354, 1745815471, 2025787534, -1278537472, 867950046, -1538943095, -2072251699, -512470495, -1791694347,
## 1246832149, -692510434, -1947597694, 1728541127, 935988054, -287721816, 226643687, 254446636, 67670196, -2131041351, -145873719, 1630845967, 564756075, -22770622, 2102408449, 269094191, 726754039, 2109902787, 2000196516, -968606306, 873033121, 1662242319, 1213235183, -2120436385, 1484757162, -150131618, -494834276, -2024279898, 1732884318, 23998209, 922976392, 109718125, -98143405, 246270223, -65502352, 1471321013, 1575051764, -295641377, -476059460, -395301027, -2115384395, -1466976344, 626261553,
## -2117459973, -1228853928, 560910765, -202627190, -834445537, -1368329272, 380808289, -551303023, 1934873975, -154147308, -548807789, 496595518, 406311671, -793724231, 475061020, 1825979089, -2130922830, -785877421, 562009042, -1293054777, 587018552, -412380624, -869481633, 1151100351, -1259833850, -1658668274, 250683501, -1918606848, 1551165118, -2524374, 1142773195, 1063547700, -1024929724, -1173259960, 213162437, -671092094, 789058892, -645935361, 2135177199, -541915347, 1330624931, 636932188,
## 1522355160, -2135804363, -230168429, 883599386, -1156793337, 1583323743, -1858568612, 1982869419, -956435664, -277103692, -1837519024, -1175788070, 790284491, -408509367, 1825223403, -287972066, -667840782, -1544161943, 1224857782, 1853608939, 1720125949, 311700065, 576419384, 991734634, -629494948, 1274161940, 66678883, 1925704464, 1106239537, -969201260, -825039992, 808922196, 2095053601, -1243049410, -355678648, 1040546274, -491865144, -1235843690, 1572990866, -308542599, 493395278, -228706489,
## -1520867345, -470997245, -1596300445, -668638245, -1325980228, -953995496, -1611395777, -2043864126, -1710258789, -1567717831, -987311879, 1971826499, -671490816, -476885362, 1387956841, -987292808, -1004514019, -1251652060, 1038015662, 1928002608, 1317937696, -947684133, 359973086, -669694508, 1319835595, 1524958889, -670631520, -1018373780, 819822116, 847788327, -791141545, 1367645726, -1661427287, 850489390, 1278023458, -1561195020, -454654744, -1636293226, -8321666, 1072888337, 668441237,
## 923382187, 1985266923, 1998388686, -1397380607, -1621794836, -2079787941, -1590416686, 1315007312, 1149535113, -1738771752, 57956568, 72475199, -906468408, -1926347209, -1011809609, -729975456, -1907346796, 1160452830, 1751444229, 418248456, 443995279, -2096730612, 1308170941, 1429547453, 384065393, -405588973, 1902955964, 202742759, -1045686688, 628119015, -65577571, -531138625, 1445846230, -1467929228, 514915678, -1671273866, 776673625, 861096816, 983589687, -1509598572, 841473833, -2101045069,
## 516653034, -1246150770, -1428572983, 1216145861, -1041044698, -1565025665, -1881814257, 625008151, -904073437, 1793649208, 83312667, -582658888, -1584155092, -468446341, -1217188361, 981568879, 1941697023, 1203087682, 342749175, -816225859, -1513153725, -1760765814, 1934718083, 208337818, 1266610744, 581901443, 1150362467, -109966971, 1474149194, -1108872385, -687968304, 989945917, -1534058343, 1685827859, 151781260, -1401449250, -308703952, -1352487516, -1603828046, 56598588, -953779480, 326503324,
## -1118882259, 207823984, -1252670205, -1932639948, 2123386273, 1079912944, 579834589, -749012868, -1424971204, 2095115152, 2130148753, -1616298589, -1332652179, -320330869, -1039787095, 266524910, 772952335, -334601154, 389547656, 770650920, 93642346, -1199739205, 1158754846, -1260818113, 24183885, -613132645, -1054033410, -1388708457, 2017121382, -1302657021, 795196204, 410921085, -110949416, 822188415, 1993626384, -1041149260, -1523651655, -1853687686, -477926326, 1304320134, 861286452, 72735592,
## -1627571465, 1134277041, -1959208081, 864590062, 574606785, -1112220517, -1428687595, -1783787317, -1559506308, -453162671, 1777136830, -2132303172, 1581990069, -985553630, -1909279106, 610408840, 1785857756, -432533824, 1177587383, -1267268366, -328129309, 1281002623, -2072664896, 13043693, 152993424, -692697901, 112910478, 650714578, 1590989334, -308107026, -1247893866, 1109463134, -1572078755, 666050437, 235306704, 734944041, 965418590, -1494777657, -237949321, -1418673972, 258125388, -436911901,
## 503703024, 1842305762, 1425240121, 1642198257, -2084056321, -2001774150, -944976157, 1413207480, -884902162, -346693713, -378132048, -1849684419, 1350893468, -191395883, 1866940704, 522201617, -339760933, 301940485, -355532480, -1608505298, -1223100962, 975572501, -1892854512, -1135873262, -2062222422, -1146554835, 572210642, 1674972393, 2020952497, 444041109, -911577406, 606304119, -1902603042, -138133708, 1211158170, 2047196414, -480494287, 711174408, -2040496739, 1914891897, 580935991, -1827202257,
## 1392888639, -535597725, 1449064779, 1699116160, 1291454555, 1101678545, 431791146, 1752833319, -628275304, 146400286, 1711093935, 1168713890, 1538253965, -1563007631, 483617290, 1007319872, 366039354, 446832742, 2097967162, 1628197313, -2106802246, 165413134, -2123652821, 2066033951, 1433508263, -640666256, -1470555294, 1326392857, -881661575, 758075731, -1605108935, -1265277357, 1985804716, -1445665171, 1433171868, -908837672, 984098224, -252399342, 1073585015, 1500369472, 739632466, -622382048,
## 1514328586, 374457072, 1799452924, -1544679791, -1417362341, 434359579, -37250300, 1785451918, 1984911491, -203390251, -808653595, -1191015509, 715728308, 32836538, 1293083443, -1074470884, -891045666, 640952152, 536579032, 1022775035, 1241950162, -2058483080, 539805337, -423345352, -2080230456, -1504138522, -1173029011, 206000844, 1947441966, 1405500420, -68476653, 203465844, 647201881, 2078758924, 1674319440, -676763924, 7601371, 938785012, -834439733, -574023142, -1860433984, 418118757, -2107767613,
## 529032408, -1607210110, 226028305, -387133259, 1465334000, -1590028556, 1363544586, -117054838, -442781939, -475029078, 358271975, -2107491768, 661235271, 1594638859, -847261, -1861551742, 1991168055, 1798201976, -856750456, 1378469421, 301557781, 294017725, -294787353, -1425679221, 906872019, -1791087308, -1991936115, 919794053, 1345727137, 276230813, 370321313, -5906563, 1932237702, 792666683, -2028606752, -713586015, -45722210, -993479184, -867350974, 1143297055, 564712667, 1387518568, 1602413238,
## -870648938, 810715895, -1317926687, 1505712556, 2136215051, 1081702434, -1875527371, -1903869155, 1616512945, -1778001838, 1948661693, -203136763, 1980877122, -641115451, 298869686, 1200122902, 324093006, -1718715723, 1331698021, -1494708, -379208348, 682867917, 885234303, -228890949, 1232542523, 1752449529, 48313893), minhashes = NULL, meta = list(file = "./data/pre_processed//1789-George Washington.txt", hash_func = "hash_string", id = "1789-George Washington", minhash_func = NULL, tokenizer = "tokenize_ngrams")))
## list(hash_func = "hash_string", tokenizer = "tokenize_ngrams")
Similarity Measure: Jaccard Similarity
Now our goal is to visualize the similarities. For this purpose, we build a 3D plot which in x-axis has one speech and y-axis has another and on z-axis the scores. In order to visualize in a clean way, we used only the first 30 ones which are the most similar ones and we used only the initials on the presidents. Taken into account that these speeches were made from some of the most important and well-known American Presidents, the plot does not lose its explanatory purposes.
Each score is presented with a ball and the colors represent the clusters. The pairs with similar scores are painted the same color.
#Choosing only the first 50 rows because otherwise the plot becomes unreadable since there are too many points
compare_df_viz <- compare_df[1:30, ]# Converting names to initials
compare_df_viz$a <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_viz$a ,perl = TRUE)
compare_df_viz$b <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_viz$b ,perl = TRUE)fig <- plot_ly(compare_df_viz, x = ~a, y = ~b, z = ~score, color=~score, size=~score)
fig <- fig |> add_markers()
fig <- fig |> layout(scene = list(xaxis = list(title = 'Doc1'),
yaxis = list(title = 'Doc2'),
zaxis = list(title = 'Similarity Score')
))
fig## Warning: `line.width` does not currently support multiple values.
Similarity Measure: Ratio of Matches
Here we use another similarity measure. The first one was based on Jaccard Similarity and this one is based in Ratio of matches. The method is the same, but the results are slightly different. Here we have higher similarity measures.
For this reason, this time we plot 50 most similar cases and with smaller ball size.
loc <- "./data/pre_processed/"
corpus <- TextReuseCorpus(dir=loc)
comparisons_rom <- pairwise_compare(corpus, ratio_of_matches)
compare_df_rom <- pairwise_candidates(comparisons_rom)
compare_df_rom <- as.data.frame(compare_df_rom,
col.names = names(compare_df_rom))
#compare_df <- compare_df[order(compare_df$score,decreasing=TRUE)]
compare_df_rom <- compare_df_rom[order(compare_df_rom$score,decreasing=TRUE),]
compare_df_rom |> head(3)## a b score
## 38 1789-George Washington 1941-Franklin D. Roosevelt 0.01628664
## 879 1857-James Buchanan 1973-Richard Nixon 0.01589103
## 1234 1897-William McKinley 1973-Richard Nixon 0.01589103
compare_df_rom_viz <- compare_df_rom[1:50, ]compare_df_rom_viz$a <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_rom_viz$a ,perl = TRUE)
compare_df_rom_viz$b <- gsub("(?<=[A-Z])[^A-Z]+", "", compare_df_rom_viz$b ,perl = TRUE)fig <- plot_ly(compare_df_rom_viz, x = ~a, y = ~b, z = ~score, color=~score, size=~score)
fig <- fig |> add_markers()
fig <- fig |> layout(scene = list(xaxis = list(title = 'Doc1'),
yaxis = list(title = 'Doc2'),
zaxis = list(title = 'Similarity Score')
))
fig## Warning: `line.width` does not currently support multiple values.
Evidenced by the plot, based on the Jaccard similarity, the top 6 most similar documents are the following pairs:
The sixth item matches intuition, as both of its documents are from the same president, “George W. Bush”.
Since the textreuse library doesn’t output a distance matrix, and instead we can only have a dataframe with three columns, two of which contain documents’ names, and the third one contain their similarity score (computed from Jaccard similarity), we implemented the transformation of the mentioned dataframe to a distance matrix. To achieve this, we pivot the score dataframe in the following manner:
distance_df <- compare_df |> pivot_wider(names_from=a, values_from=score)
distance_df <- replace(distance_df, is.na(distance_df), 0)
distance_mat <- data.matrix(distance_df)Moreover, the library doesn’t provide a function to compute cosine similarity between any pair of documents of the corpus, in below we implemented computation of cosine similarity between two given documents of the corpus, and then construct a distance matrix for all documents of the corpus.
# compute cosine similarity between two documents
dtms[,1]## <<DocumentTermMatrix (documents: 59, terms: 1)>>
## Non-/sparse entries: 51/8
## Sparsity : 14%
## Maximal term length: 6
## Weighting : term frequency (tf)
cosine_sim <- tcrossprod_simple_triplet_matrix(dtms[,1], dtms[,2])/sqrt(row_sums(dtms[,2]^2) %*% t(row_sums(dtms[,1]^2)))# construct cosine distance matrix
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(dtms)/(sqrt(col_sums(dtms^2) %*% t(col_sums(dtms^2))))
cosine_dist_mat## Terms
## Terms action american called can citizens countries
## action 0.0000000 0.4512145 0.3928608 0.1914627 0.2885010 0.2354029
## american 0.4512145 0.0000000 0.3440233 0.3682255 0.3491969 0.4953837
## called 0.3928608 0.3440233 0.0000000 0.3478728 0.2943415 0.3663427
## can 0.1914627 0.3682255 0.3478728 0.0000000 0.2937652 0.2820930
## citizens 0.2885010 0.3491969 0.2943415 0.2937652 0.0000000 0.2305978
## countries 0.2354029 0.4953837 0.3663427 0.2820930 0.2305978 0.0000000
## everincreasing 0.2971282 0.3990526 0.3583390 0.3168069 0.2351152 0.1888009
## faith 0.3568184 0.4530689 0.4423469 0.3254937 0.3860277 0.3566596
## fellowcitizens 0.3460135 0.4268916 0.3006944 0.3888444 0.2743366 0.3119060
## free 0.3844362 0.5361073 0.4833159 0.3957589 0.3504378 0.3396989
## future 0.3021953 0.3589264 0.3384683 0.3110346 0.4135075 0.4095629
## good 0.3545748 0.4199530 0.2860814 0.2494620 0.3785558 0.3615394
## government 0.2525751 0.4381982 0.3453656 0.2201304 0.1994509 0.1747006
## great 0.2609207 0.4896087 0.3699088 0.2731825 0.2515469 0.2318280
## hope 0.3286828 0.3762422 0.2937968 0.2287876 0.4136878 0.3656512
## justice 0.3186538 0.3691250 0.3192817 0.2202974 0.3049572 0.2399347
## life 0.4049565 0.4155350 0.4596915 0.3979175 0.4784356 0.5087264
## make 0.2783959 0.3338100 0.3488047 0.2050114 0.3107930 0.3335780
## manifest 0.2721808 0.4398752 0.3680779 0.2599952 0.2067473 0.2188805
## may 0.2549571 0.5569706 0.3077051 0.2454736 0.2169508 0.2054708
## must 0.2850261 0.2334725 0.3119320 0.2257792 0.3554914 0.3407375
## nation 0.2890731 0.3290200 0.3313436 0.1821479 0.3267104 0.2719947
## new 0.5531409 0.2996098 0.5122666 0.3658506 0.5132780 0.6125910
## now 0.3405962 0.3367251 0.4490085 0.2702416 0.4006965 0.3930551
## one 0.2374208 0.3525094 0.3663620 0.1909680 0.2015558 0.2444644
## peace 0.3972139 0.5403116 0.5058369 0.2774430 0.4913154 0.3538622
## people 0.2410530 0.3242796 0.3494222 0.1785456 0.1869730 0.2233377
## place 0.3491567 0.4476874 0.2489438 0.2873019 0.2401656 0.3928137
## power 0.2950354 0.6203970 0.3935090 0.3753447 0.1903525 0.2766009
## purpose 0.3696441 0.5140156 0.4194362 0.3612525 0.3286782 0.4091306
## rights 0.1842631 0.5352416 0.4215250 0.1840578 0.2638157 0.1835227
## secure 0.2984811 0.4466394 0.4369011 0.2436991 0.3035170 0.2182761
## shall 0.3460512 0.5735407 0.4031659 0.3109780 0.3987844 0.3592415
## states 0.3327275 0.6482478 0.4557974 0.3434238 0.2502237 0.2284206
## time 0.2352651 0.2445026 0.2824927 0.2204211 0.2385485 0.3153410
## united 0.3251816 0.5217468 0.4375936 0.3284096 0.2826144 0.2114620
## will 0.2372428 0.2107531 0.2879182 0.1866537 0.2775627 0.2463389
## without 0.2963509 0.5419869 0.3300456 0.3174816 0.3123329 0.2459404
## world 0.4817611 0.3071941 0.3940953 0.3106657 0.5278569 0.5463383
## Terms
## Terms everincreasing faith fellowcitizens free future
## action 0.2971282 0.3568184 0.3460135 0.3844362 0.3021953
## american 0.3990526 0.4530689 0.4268916 0.5361073 0.3589264
## called 0.3583390 0.4423469 0.3006944 0.4833159 0.3384683
## can 0.3168069 0.3254937 0.3888444 0.3957589 0.3110346
## citizens 0.2351152 0.3860277 0.2743366 0.3504378 0.4135075
## countries 0.1888009 0.3566596 0.3119060 0.3396989 0.4095629
## everincreasing 0.0000000 0.3467121 0.2427377 0.4095602 0.2923592
## faith 0.3467121 0.0000000 0.4644046 0.2628773 0.3053668
## fellowcitizens 0.2427377 0.4644046 0.0000000 0.4560197 0.3931916
## free 0.4095602 0.2628773 0.4560197 0.0000000 0.4122646
## future 0.2923592 0.3053668 0.3931916 0.4122646 0.0000000
## good 0.3489480 0.3992667 0.3590024 0.3976671 0.3031706
## government 0.2038159 0.3688935 0.2876922 0.3795765 0.3821522
## great 0.1937509 0.3854978 0.2755640 0.4094539 0.3243857
## hope 0.3480163 0.3070473 0.4859727 0.3520430 0.2388759
## justice 0.2201156 0.3163116 0.3480742 0.4156400 0.3005773
## life 0.4293273 0.3590834 0.5446968 0.3650165 0.3979645
## make 0.3631464 0.3768318 0.4385316 0.3950963 0.3257788
## manifest 0.2869856 0.4041606 0.4064413 0.4414995 0.4171403
## may 0.2485394 0.4636186 0.2479175 0.3691530 0.4220194
## must 0.3402277 0.3610163 0.4611040 0.3917937 0.2996883
## nation 0.2570508 0.2458007 0.4009425 0.3768906 0.2068874
## new 0.4450541 0.5064848 0.5202311 0.5644522 0.3712807
## now 0.3251475 0.4073692 0.3887862 0.4604766 0.3613852
## one 0.2820908 0.4550849 0.2705911 0.3994467 0.3747356
## peace 0.4595935 0.2712644 0.4948846 0.3758068 0.3654798
## people 0.2322754 0.2652262 0.3353036 0.2607631 0.3366070
## place 0.3083978 0.4320127 0.2600338 0.4804085 0.2962116
## power 0.3350926 0.5898255 0.2604168 0.4132375 0.4913846
## purpose 0.3815075 0.3535583 0.5114757 0.4384033 0.4359791
## rights 0.2515484 0.3683577 0.3437515 0.3572963 0.3149498
## secure 0.3290915 0.3212865 0.4415170 0.3377984 0.3223358
## shall 0.3572345 0.3406215 0.4262117 0.3541365 0.4197767
## states 0.2709801 0.4823813 0.2853556 0.3797843 0.4980472
## time 0.2231006 0.3318286 0.2260997 0.3601695 0.1644345
## united 0.2298496 0.3770863 0.2882830 0.3042528 0.4195718
## will 0.2012852 0.3264156 0.3319261 0.3822256 0.2125976
## without 0.2063253 0.3829404 0.3118560 0.3926966 0.3875567
## world 0.5269638 0.3859550 0.5531230 0.4178739 0.3650131
## Terms
## Terms good government great hope justice life
## action 0.3545748 0.2525751 0.2609207 0.3286828 0.3186538 0.4049565
## american 0.4199530 0.4381982 0.4896087 0.3762422 0.3691250 0.4155350
## called 0.2860814 0.3453656 0.3699088 0.2937968 0.3192817 0.4596915
## can 0.2494620 0.2201304 0.2731825 0.2287876 0.2202974 0.3979175
## citizens 0.3785558 0.1994509 0.2515469 0.4136878 0.3049572 0.4784356
## countries 0.3615394 0.1747006 0.2318280 0.3656512 0.2399347 0.5087264
## everincreasing 0.3489480 0.2038159 0.1937509 0.3480163 0.2201156 0.4293273
## faith 0.3992667 0.3688935 0.3854978 0.3070473 0.3163116 0.3590834
## fellowcitizens 0.3590024 0.2876922 0.2755640 0.4859727 0.3480742 0.5446968
## free 0.3976671 0.3795765 0.4094539 0.3520430 0.4156400 0.3650165
## future 0.3031706 0.3821522 0.3243857 0.2388759 0.3005773 0.3979645
## good 0.0000000 0.3201950 0.3083112 0.3214639 0.3017595 0.3510038
## government 0.3201950 0.0000000 0.2123451 0.3647836 0.2539866 0.4745677
## great 0.3083112 0.2123451 0.0000000 0.3981090 0.2378097 0.4473356
## hope 0.3214639 0.3647836 0.3981090 0.0000000 0.2564148 0.3399784
## justice 0.3017595 0.2539866 0.2378097 0.2564148 0.0000000 0.3127039
## life 0.3510038 0.4745677 0.4473356 0.3399784 0.3127039 0.0000000
## make 0.2873995 0.2739569 0.2939698 0.2333408 0.3361181 0.3691802
## manifest 0.3308990 0.2048855 0.2196758 0.4000634 0.2093755 0.4315715
## may 0.3178149 0.1706117 0.2024905 0.3669758 0.2842671 0.5787241
## must 0.4144707 0.3108007 0.3710546 0.2327013 0.2590383 0.3624059
## nation 0.2307684 0.2566272 0.2457434 0.2339198 0.1285089 0.2459278
## new 0.4346620 0.4996151 0.4733851 0.3988559 0.4009755 0.3771771
## now 0.3565571 0.3111200 0.3184733 0.4028245 0.3937696 0.4869662
## one 0.3658407 0.1775551 0.2911584 0.3646145 0.3517557 0.5381730
## peace 0.4039819 0.3537856 0.3367790 0.3350077 0.2381164 0.3600799
## people 0.2613979 0.1517226 0.2037242 0.3037056 0.2151865 0.3534008
## place 0.2777006 0.2853730 0.2238508 0.3845113 0.2891575 0.4903024
## power 0.4245210 0.2204875 0.2720766 0.5515264 0.4437583 0.6447565
## purpose 0.3854439 0.3293051 0.2953225 0.4350141 0.2714008 0.3667078
## rights 0.3319872 0.1557096 0.2282462 0.3197530 0.2528017 0.4938392
## secure 0.3973605 0.2483849 0.2744137 0.2487225 0.2700347 0.4486339
## shall 0.3215939 0.2641309 0.3084373 0.4048258 0.3342515 0.4896929
## states 0.4097998 0.1479527 0.2124300 0.5428521 0.3229409 0.6545667
## time 0.2969035 0.2475803 0.2337405 0.2362141 0.2571278 0.3884673
## united 0.3491342 0.2031798 0.1862040 0.4911462 0.3086215 0.4928024
## will 0.2150069 0.2223562 0.2117254 0.2525104 0.1947787 0.3868693
## without 0.2865643 0.2612675 0.2046467 0.3480220 0.2237040 0.4439553
## world 0.4397523 0.5139663 0.4973628 0.2823261 0.3588431 0.3173307
## Terms
## Terms make manifest may must nation new
## action 0.2783959 0.2721808 0.2549571 0.2850261 0.2890731 0.5531409
## american 0.3338100 0.4398752 0.5569706 0.2334725 0.3290200 0.2996098
## called 0.3488047 0.3680779 0.3077051 0.3119320 0.3313436 0.5122666
## can 0.2050114 0.2599952 0.2454736 0.2257792 0.1821479 0.3658506
## citizens 0.3107930 0.2067473 0.2169508 0.3554914 0.3267104 0.5132780
## countries 0.3335780 0.2188805 0.2054708 0.3407375 0.2719947 0.6125910
## everincreasing 0.3631464 0.2869856 0.2485394 0.3402277 0.2570508 0.4450541
## faith 0.3768318 0.4041606 0.4636186 0.3610163 0.2458007 0.5064848
## fellowcitizens 0.4385316 0.4064413 0.2479175 0.4611040 0.4009425 0.5202311
## free 0.3950963 0.4414995 0.3691530 0.3917937 0.3768906 0.5644522
## future 0.3257788 0.4171403 0.4220194 0.2996883 0.2068874 0.3712807
## good 0.2873995 0.3308990 0.3178149 0.4144707 0.2307684 0.4346620
## government 0.2739569 0.2048855 0.1706117 0.3108007 0.2566272 0.4996151
## great 0.2939698 0.2196758 0.2024905 0.3710546 0.2457434 0.4733851
## hope 0.2333408 0.4000634 0.3669758 0.2327013 0.2339198 0.3988559
## justice 0.3361181 0.2093755 0.2842671 0.2590383 0.1285089 0.4009755
## life 0.3691802 0.4315715 0.5787241 0.3624059 0.2459278 0.3771771
## make 0.0000000 0.3431624 0.3398646 0.2531402 0.2976471 0.3380555
## manifest 0.3431624 0.0000000 0.2628802 0.3584185 0.2491514 0.5421030
## may 0.3398646 0.2628802 0.0000000 0.3689675 0.3277573 0.6337560
## must 0.2531402 0.3584185 0.3689675 0.0000000 0.2558214 0.3379381
## nation 0.2976471 0.2491514 0.3277573 0.2558214 0.0000000 0.3028733
## new 0.3380555 0.5421030 0.6337560 0.3379381 0.3028733 0.0000000
## now 0.3363157 0.4029997 0.3803697 0.3446933 0.2833724 0.4120310
## one 0.2772522 0.2803198 0.2330540 0.3249503 0.3580562 0.4827211
## peace 0.3413758 0.3548470 0.4596216 0.3619569 0.2063107 0.4010392
## people 0.2685066 0.2189424 0.2016418 0.2610099 0.1722572 0.4228014
## place 0.3700452 0.3073992 0.2987367 0.4075232 0.2747317 0.4570153
## power 0.4641451 0.2834444 0.1677866 0.5589987 0.4785666 0.7006705
## purpose 0.4312651 0.2790093 0.3818978 0.3834064 0.2721801 0.5525744
## rights 0.3209995 0.2166184 0.1960175 0.3539563 0.2607443 0.5542660
## secure 0.2132659 0.2832253 0.2573536 0.2724153 0.2595902 0.5509508
## shall 0.3495582 0.3279675 0.2220753 0.4286209 0.3320574 0.5808853
## states 0.4247068 0.2117812 0.1307284 0.4977840 0.3806853 0.6808303
## time 0.2380485 0.3394911 0.3090275 0.2445813 0.1920628 0.2580735
## united 0.3809815 0.2877599 0.2488076 0.4266492 0.2725213 0.6102369
## will 0.1688238 0.2177655 0.2951557 0.2096607 0.1654258 0.3055922
## without 0.3499942 0.2533404 0.1493131 0.4055130 0.2806816 0.6078607
## world 0.3554393 0.5324166 0.5989574 0.2280928 0.2502530 0.2142673
## Terms
## Terms now one peace people place power
## action 0.3405962 0.2374208 0.3972139 0.2410530 0.3491567 0.2950354
## american 0.3367251 0.3525094 0.5403116 0.3242796 0.4476874 0.6203970
## called 0.4490085 0.3663620 0.5058369 0.3494222 0.2489438 0.3935090
## can 0.2702416 0.1909680 0.2774430 0.1785456 0.2873019 0.3753447
## citizens 0.4006965 0.2015558 0.4913154 0.1869730 0.2401656 0.1903525
## countries 0.3930551 0.2444644 0.3538622 0.2233377 0.3928137 0.2766009
## everincreasing 0.3251475 0.2820908 0.4595935 0.2322754 0.3083978 0.3350926
## faith 0.4073692 0.4550849 0.2712644 0.2652262 0.4320127 0.5898255
## fellowcitizens 0.3887862 0.2705911 0.4948846 0.3353036 0.2600338 0.2604168
## free 0.4604766 0.3994467 0.3758068 0.2607631 0.4804085 0.4132375
## future 0.3613852 0.3747356 0.3654798 0.3366070 0.2962116 0.4913846
## good 0.3565571 0.3658407 0.4039819 0.2613979 0.2777006 0.4245210
## government 0.3111200 0.1775551 0.3537856 0.1517226 0.2853730 0.2204875
## great 0.3184733 0.2911584 0.3367790 0.2037242 0.2238508 0.2720766
## hope 0.4028245 0.3646145 0.3350077 0.3037056 0.3845113 0.5515264
## justice 0.3937696 0.3517557 0.2381164 0.2151865 0.2891575 0.4437583
## life 0.4869662 0.5381730 0.3600799 0.3534008 0.4903024 0.6447565
## make 0.3363157 0.2772522 0.3413758 0.2685066 0.3700452 0.4641451
## manifest 0.4029997 0.2803198 0.3548470 0.2189424 0.3073992 0.2834444
## may 0.3803697 0.2330540 0.4596216 0.2016418 0.2987367 0.1677866
## must 0.3446933 0.3249503 0.3619569 0.2610099 0.4075232 0.5589987
## nation 0.2833724 0.3580562 0.2063107 0.1722572 0.2747317 0.4785666
## new 0.4120310 0.4827211 0.4010392 0.4228014 0.4570153 0.7006705
## now 0.0000000 0.3077061 0.4274371 0.2405033 0.3932803 0.5083115
## one 0.3077061 0.0000000 0.4508339 0.2563762 0.3430625 0.2260662
## peace 0.4274371 0.4508339 0.0000000 0.3037301 0.4391556 0.5824416
## people 0.2405033 0.2563762 0.3037301 0.0000000 0.2836519 0.2901596
## place 0.3932803 0.3430625 0.4391556 0.2836519 0.0000000 0.3156333
## power 0.5083115 0.2260662 0.5824416 0.2901596 0.3156333 0.0000000
## purpose 0.3717439 0.4842810 0.3319412 0.2642954 0.3717797 0.4431354
## rights 0.3219160 0.1857221 0.3409555 0.2246640 0.3022974 0.2331547
## secure 0.3697238 0.3197341 0.3283917 0.2497686 0.4010950 0.4233339
## shall 0.3135053 0.4191976 0.3783925 0.2386692 0.4091058 0.4111691
## states 0.3716212 0.2728217 0.4648671 0.2404830 0.3288980 0.1773186
## time 0.2116518 0.2403845 0.3355499 0.2144179 0.2347682 0.3826398
## united 0.2985044 0.3207590 0.3627736 0.2487462 0.3126846 0.2896242
## will 0.1806064 0.2286200 0.3419625 0.1902025 0.2864015 0.4248998
## without 0.2953097 0.3574022 0.4460921 0.2197841 0.3225104 0.3050253
## world 0.4448365 0.4664292 0.2274803 0.3595323 0.4343980 0.7214903
## Terms
## Terms purpose rights secure shall states time
## action 0.3696441 0.1842631 0.2984811 0.3460512 0.3327275 0.2352651
## american 0.5140156 0.5352416 0.4466394 0.5735407 0.6482478 0.2445026
## called 0.4194362 0.4215250 0.4369011 0.4031659 0.4557974 0.2824927
## can 0.3612525 0.1840578 0.2436991 0.3109780 0.3434238 0.2204211
## citizens 0.3286782 0.2638157 0.3035170 0.3987844 0.2502237 0.2385485
## countries 0.4091306 0.1835227 0.2182761 0.3592415 0.2284206 0.3153410
## everincreasing 0.3815075 0.2515484 0.3290915 0.3572345 0.2709801 0.2231006
## faith 0.3535583 0.3683577 0.3212865 0.3406215 0.4823813 0.3318286
## fellowcitizens 0.5114757 0.3437515 0.4415170 0.4262117 0.2853556 0.2260997
## free 0.4384033 0.3572963 0.3377984 0.3541365 0.3797843 0.3601695
## future 0.4359791 0.3149498 0.3223358 0.4197767 0.4980472 0.1644345
## good 0.3854439 0.3319872 0.3973605 0.3215939 0.4097998 0.2969035
## government 0.3293051 0.1557096 0.2483849 0.2641309 0.1479527 0.2475803
## great 0.2953225 0.2282462 0.2744137 0.3084373 0.2124300 0.2337405
## hope 0.4350141 0.3197530 0.2487225 0.4048258 0.5428521 0.2362141
## justice 0.2714008 0.2528017 0.2700347 0.3342515 0.3229409 0.2571278
## life 0.3667078 0.4938392 0.4486339 0.4896929 0.6545667 0.3884673
## make 0.4312651 0.3209995 0.2132659 0.3495582 0.4247068 0.2380485
## manifest 0.2790093 0.2166184 0.2832253 0.3279675 0.2117812 0.3394911
## may 0.3818978 0.1960175 0.2573536 0.2220753 0.1307284 0.3090275
## must 0.3834064 0.3539563 0.2724153 0.4286209 0.4977840 0.2445813
## nation 0.2721801 0.2607443 0.2595902 0.3320574 0.3806853 0.1920628
## new 0.5525744 0.5542660 0.5509508 0.5808853 0.6808303 0.2580735
## now 0.3717439 0.3219160 0.3697238 0.3135053 0.3716212 0.2116518
## one 0.4842810 0.1857221 0.3197341 0.4191976 0.2728217 0.2403845
## peace 0.3319412 0.3409555 0.3283917 0.3783925 0.4648671 0.3355499
## people 0.2642954 0.2246640 0.2497686 0.2386692 0.2404830 0.2144179
## place 0.3717797 0.3022974 0.4010950 0.4091058 0.3288980 0.2347682
## power 0.4431354 0.2331547 0.4233339 0.4111691 0.1773186 0.3826398
## purpose 0.0000000 0.3718871 0.4045233 0.3388363 0.3858791 0.3887969
## rights 0.3718871 0.0000000 0.2228500 0.2444715 0.1672523 0.2836233
## secure 0.4045233 0.2228500 0.0000000 0.3287758 0.3129164 0.3023304
## shall 0.3388363 0.2444715 0.3287758 0.0000000 0.2378221 0.3764621
## states 0.3858791 0.1672523 0.3129164 0.2378221 0.0000000 0.3583509
## time 0.3887969 0.2836233 0.3023304 0.3764621 0.3583509 0.0000000
## united 0.2918401 0.2452104 0.2763979 0.3266657 0.1663374 0.3135349
## will 0.3681593 0.2249573 0.2435735 0.2853986 0.3194047 0.1662107
## without 0.3183599 0.2761227 0.2710091 0.2536850 0.2140895 0.3415666
## world 0.5010118 0.5393091 0.4748047 0.5614315 0.7254198 0.3115638
## Terms
## Terms united will without world
## action 0.3251816 0.2372428 0.2963509 0.4817611
## american 0.5217468 0.2107531 0.5419869 0.3071941
## called 0.4375936 0.2879182 0.3300456 0.3940953
## can 0.3284096 0.1866537 0.3174816 0.3106657
## citizens 0.2826144 0.2775627 0.3123329 0.5278569
## countries 0.2114620 0.2463389 0.2459404 0.5463383
## everincreasing 0.2298496 0.2012852 0.2063253 0.5269638
## faith 0.3770863 0.3264156 0.3829404 0.3859550
## fellowcitizens 0.2882830 0.3319261 0.3118560 0.5531230
## free 0.3042528 0.3822256 0.3926966 0.4178739
## future 0.4195718 0.2125976 0.3875567 0.3650131
## good 0.3491342 0.2150069 0.2865643 0.4397523
## government 0.2031798 0.2223562 0.2612675 0.5139663
## great 0.1862040 0.2117254 0.2046467 0.4973628
## hope 0.4911462 0.2525104 0.3480220 0.2823261
## justice 0.3086215 0.1947787 0.2237040 0.3588431
## life 0.4928024 0.3868693 0.4439553 0.3173307
## make 0.3809815 0.1688238 0.3499942 0.3554393
## manifest 0.2877599 0.2177655 0.2533404 0.5324166
## may 0.2488076 0.2951557 0.1493131 0.5989574
## must 0.4266492 0.2096607 0.4055130 0.2280928
## nation 0.2725213 0.1654258 0.2806816 0.2502530
## new 0.6102369 0.3055922 0.6078607 0.2142673
## now 0.2985044 0.1806064 0.2953097 0.4448365
## one 0.3207590 0.2286200 0.3574022 0.4664292
## peace 0.3627736 0.3419625 0.4460921 0.2274803
## people 0.2487462 0.1902025 0.2197841 0.3595323
## place 0.3126846 0.2864015 0.3225104 0.4343980
## power 0.2896242 0.4248998 0.3050253 0.7214903
## purpose 0.2918401 0.3681593 0.3183599 0.5010118
## rights 0.2452104 0.2249573 0.2761227 0.5393091
## secure 0.2763979 0.2435735 0.2710091 0.4748047
## shall 0.3266657 0.2853986 0.2536850 0.5614315
## states 0.1663374 0.3194047 0.2140895 0.7254198
## time 0.3135349 0.1662107 0.3415666 0.3115638
## united 0.0000000 0.2757858 0.2547703 0.5568450
## will 0.2757858 0.0000000 0.2448891 0.3525221
## without 0.2547703 0.2448891 0.0000000 0.5924759
## world 0.5568450 0.3525221 0.5924759 0.0000000
Firstly, it is worth mentioning our understanding behind the most used terms. As shown above, the first most frequent word was “WILL”. We believe, from our general knowledge as well, that this is a significant word in politician speeches. Politicians make promises, and when one promises he/she usually uses the future tense. We also notice that words like: Govern which is the stem for government, state and nation are frequently used. Again these are typical words from a state leader and we also expected them to be used frequently.
Secondly, regarding the document similarity we used two different measurement methods: respectively Jaccard Similarity and Ratio of Matches. They both presented low scores as the result, but at least for the first ranked pairs the results from Ratio of Matches are twice as high in comparison with the first method. //